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^ Abstract 

Oh We study three families of online convex optimization algorithms: follow-the-proximally- 

regularized-leader (FTRL-Proximal), regularized dual averaging (RDA), and composite- 
objective mirror descent. We first prove equivalence theorems that show all of these 
algorithms are instantiations of a general FTRL update. This provides theoretical in- 
^SJ sight on previous experimental observations. In particular, even though the FOBOS 

composite mirror descent algorithm handles L\ regularization explicitly, it has been 

O observed that RDA is even more effective at producing sparsity. Our results demon- 
strate that FOBOS uses subgradient approximations to the L\ penalty from previous 
i_j rounds, leading to less sparsity than RDA, which handles the cumulative penalty in 

^2 closed form. The FTRL-Proximal algorithm can be seen as a hybrid of these two, and 

O outperforms both on a large, real-world dataset. 

Our second contribution is a unified analysis which produces regret bounds that 
match (up to logarithmic terms) or improve the best previously known bounds. This 
analysis also extends these algorithms in two important ways: we support a more 
(**•**) general type of composite objective and we analyze implicit updates, which replace the 

subgradient approximation of the current loss function with an exact optimization. 

(N 

C*~) Keywords: online learning, online convex optimization, subgradient methods, regret 

bounds, follow-the-leader algorithms 

o 

y—\ 1 Introduction 

We consider the problem of online convex optimization, and in particular its application to 
online learning. On each round t — 1, . . . , T, we must pick a point Xt E K". A convex loss 
function f t is then revealed, and we incur loss / t (x t ). Our regret at the end of T rounds 
with respect to a comparator point x is 

T T 

Regret = ^ / t (x t ) - ^ f t (x). 
t=i t=i 

In Section [4] we provide a unified regret analysis of three prominent algorithms for online 
convex optimization. In recent years, these algorithms have received significant attention 
because they have straightforward and efficient implementations and offer state-of-the-art 
performance for many large-scale applications. In particular, we consider: 
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• Follow-the-Proximally-Regularized-Leader (FTPRL), introduced with adaptive learn- 
ing rates (regularization) by McMahan and Streeter (2010). 



Regularized Dual Averaging (RDA), introduced by Xiao (2009) and extended with 



adaptive learning rates by Duchi et al. (2010a) 



• Composite-Objective Mirror Descent (COMID) algorithms (Duchi et al. 2010b I, in- 
cluding FOBOS ([Duchi and Singer] [2509]). 



As pointed out by Duchi et al. (2010b), the analyses of RDA and COMID cited above 



are completely different. In contrast, we provide a unified analysis of these algorithms. 
One of our contributions is simply demonstrating that this large and important family of 
algorithms can be analyzed using a common argument, but our analysis also generalizes 
previous results in several important ways. First, we extend all of these algorithm to handle 
implicit updates, which replace the first-order approximation on the current loss function 
with an exact optimization. In many practical situations this update can be solved efficiently, 
and offers both theoretical and practical benefits compared to the first-order update. 

We also extend the ability of these algorithms to handle composite objectives (objectives 
that include a fixed non-smooth term 'J). Previous work considers loss functions on each 
round of the form ft(x) + ^(x), where ft is approximated by a linear function, but the 
optimization over <]/ is exact. However, as discussed below, continuing to add a new copy 
of Vl/fx) on each round may be undesirable in some cases; to address this, we analyze loss 
functions of the form ft{x) + at^(x) where at is a non-increasing sequence of non-negative 
numbers. This is useful, for example, if one wishes to encode a Bayesian prior in the online 
setting (see Section |2.2[ ) . Our proof technique has the advantage that handling this general 
form of composite updates requires only a few extra lines beyond the non-composite proof. 



The original analysis of FTPRL by McMahan and Streeter ( 2010| did not support composite 



updates. In addition to remedying this, we prove a new stronger version of the "FTRL/BTL 
Lemma" which tightens the analysis of FTPRL by a constant factor. The new lemma is 
quite general and may be of independent interest. 

Our unified analysis relies on a formulation of all of these algorithms as instances of 
follow-the-regularized-leader, which we develop in Section [3] A preliminary version of these 
equivalence results appeared in (McMahan 20101. Our equivalence theorems apply to algo- 



rithms that use arbitrary strongly convex regularization; however, these results show that 
the most interesting strict equivalences occur in the case of quadratic regularization. Thus, 
for the analysis of Section [4] we restrict attention to this case, namely to algorithms where 
the incremental strong convexity is of the form 

Rt{x) = \\\Qhx-y)t 

where y G R n and Qt is a positive-semidefinite matrix. This is less general than previous 
results in terms of arbitrary strongly-convex functions or Bregman divergences. 



Application to Sparse Models via L\ Regularization On the surface, follow-the- 



regularized-leader algorithms like regularized dual averaging ( Xiaol 2009) appear quite dif- 



ferent from gradient descent (and more generally, mirror descent) style algorithms like FO- 
BOS (Duchi and Singer 2009). However, the results of Section [3] show that in the case of 
quadratic stabilizing regularization there are only two differences between the algorithms: 
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How they choose to center the additional strong convexity used to guarantee low regret: 
RDA centers this regularization at the origin, while FOBOS centers it at the current 
feasible point. 

How they handle an arbitrary non-smooth regularization function This includes 
the mechanism of projection onto a feasible set and how L\ regularization is handled. 



To make these differences precise while also illustrating that these families are actually 
closely related, we consider a third algorithm, FTRL-Proximal. When the non-smooth term 
"J is omitted, this algorithm is in fact identical to FOBOS. On the other hand, its update 
is essentially the same as that of dual averaging, except that additional strong convexity is 
centered at the current feasible point (see Table [T]) . 

Previous work has shown experimentally that dual averaging with Li regularization is 
much more effective at introducing sparsity than FOBOS ( |Xiao 2009 Duchi et al. 2010a | 



Our equivalence theorems provide a theoretical explanation for this: while RDA considers 
the cumulative L\ penalty £A|j:r||i on round i, FOBOS (when viewed as a global optimization 
using our equivalence theorem) considers 4>ut-i m X+X\\x\\i, where <p s is a certain subgradient 
approximation of A||cc s ||i (we use <f>ut—i as shorthand for X)l=i 0s, and extend the notation 
to sums over matrices and functions as needed). 

An experimental comparison of FOBOS, RDA, and FTRL-Proximal, presented in Sec- 
tion [5] demonstrates the validity of the above explanation. The FTRL-Proximal algorithm 
behaves very similarly to RDA in terms of sparsity, confirming that it is the cumulative 
subgradient approximation to the L\ penalty that causes decreased sparsity in FOBOS. 

In recent years, online gradient descent and stochastic gradient descent (its batch ana- 
logue) have proven themselves to be excellent algorithms for large-scale machine learning. 
In the simplest case FTRL-Proximal is identical, but when L\ or other non-smooth regular- 
ization is needed, FTRL-Proximal significantly outperforms FOBOS, and can outperform 
RDA as well. Since the implementations of FTRL-Proximal and RDA only differ by a few 
lines of code, we recommend trying both and picking the one with the best performance in 
practice. 



2 Algorithms and Regret Bounds 

We begin by establishing notation and introducing more formally the algorithms we con- 
sider. We consider loss functions ft(x) + at^(x), where ^ is a fixed (typically non-smooth) 
regularization function. In a typical online learning setting, given an example (0 tl y t ) where 
6 t € K" is a feature vector and y t G { — 1, 1} is a label, we take ft(x) = loss(# t • x,y t ). For 
example, for logistic regression we use log-loss, loss(0 t • x, y t ) — log(l + exp(—y t 9 t ■ xj). All of 
the algorithms we consider support composite updates (consideration of explicitly rather 
than through a gradient V/t(#t)) as well as positive semi-definite matrix learning rates Q 
which can be chosen adaptively (the interpretation of these matrices as learning rates will 
be clarified in Section [3]). 

We first consider the specific algorithms used in the L\ experiments of Section [5] we 
use the standard reduction to linear functions, letting g t — V/t(xt). The first algorithm we 
consider is from the gradient-descent family, namely FOBOS, which plays 

1 i ,,2 
x t+ x = argmin^t • x + A||ac|| ± + ~||Qi !t (x - x t )\\ r 
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(A) (B) (C) 



COMID argmin^ 9i :t -i ■ x + ft(x) + (f> 1:t -i ■ x + a t ^{x) +\ J2l=i \\Q2 



1 Y^* I|n2 f m _ ^ M|2 



J, c 



RDA argmin, g'^ ■ x + f t (x) + a 1:t ^(x) +\ £* =1 \\Q» i? ~ 0)|' 2 



FTPRL argmin, • x + f t {x) + a 1:t ^(x) +hT,Ui\\Q' ( x 



X a 



AOGD argmin, g^ t _ x ■ x + f t (x) + 1:t _r - x + V{x) +|E!=i \\Qs(x- 0)||| 



Table 1: The algorithms considered in this paper, expressed as particular instances of the 
update of Eq. ([3|. The fact that we can express COMID and adaptive online gradient descent 
(AOGD) in this way is a consequence of Theorems [4] and [7j Each algorithms' objective has 
three components: (A) An approximation to the sum of previous loss functions fi-.t, where 
the first t — 1 functions are approximated by linear terms, and f t is included exactly (exactly 
including f t make the updates implicit). (B) Terms for the non-smooth composite terms 
a t ^ ■ COMID approximates the terms for a.i t -i^ by subgradients, while RDA and FTPRL 
consider them exactly. And finally, (C), stabilizing regularization needed to ensure low 
regret. 



We state this algorithm implicitly as an optimization, but a gradient-descent style closed- 
form update can also be given (Duchi and Singer 2009). The algorithm was described in 



this form as a specific composite-objective mirror descent (COMID) algorithm by Duchi 
eTaLl^OlObl). 



The regularized dual averaging (RDA) algorithm of |Xiao] p009| plays 



x t+1 = argmmg 1:t 



tx 



1 



t 

s=l 



'(x-0) 



In contrast to FOBOS, the RDA optimization is over the sum g\ :t rather than just the most 
recent gradient g t . We will show (in Theorem [T| that when A = and the f t are not strongly 
convex, this algorithm is in fact equivalent to the adaptive online gradient descent (AOGD) 
algorithm of iBartlett et all (12007). 



RDA is directly defined as a FTRL algorithm, and hence is also an instance of the 



Kakade et al. 



more general primal-dual algorithmic schema of Shalev-Shwartz and Singer (2006); see also 
(2009). However, these general results are not sufficient to prove the original 



bounds for RDA, nor the versions here that extend to implicit updates. 
The FTRL-Proximal algorithm plays 



Xt+i = argmin<7i :t • x + t\\\x\ 



t 

El 

s=l 



Qs [X x s ^ 



This algorithm was introduced by McMahan and Streeter (2010), but without support for 
an explicit \f r . 

One of our principle contributions is showing the close connection between all four of 
these algorithms; Table [l] summarizes the key results from Theorems [4] and [7J writing AOGD 
and FOBOS in a form that makes the relationship to RDA and FTRL-Proximal explicit. 
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In our equivalence analysis, we will consider arbitrary convex functions R t and Rt in 

II 1 II 2 II 1 II 2 

place of the ^||<3 t 2 x|| 2 and ^||Q t 2 (x — it)|L that appear here, as well as arbitrary convex 
in place of A||x||i. 

2.1 Implicit and Composite Updates for FTRL 

The algorithms we consider can be expressed as follow-the-regularized-leader (FTRL) al- 
gorithms that perform implicit and composite updates. The standard subgradient FTRL 
algorithm uses the update 

x t+1 = argmin Vfs(x s )J • x + Ri tt (x). 

In this update, each previous (potentially non-linear) loss function f a is approximated by 
the gradient at x s (when f s is not differentiable, we can use a subgradient at x s in place 
of the gradient). The functions Rt are incremental regularization added on each round; for 
example Ri-t(x) = \/i\\x\\ 2 is a standard choice, corresponding to regularized dual averaging. 

Implicit update rules are usually defined for mirror descent algorithms, but we can define 
an analogous update for FTRL: 

x t +i = argmin {^2^ V f s (x s+1 )^ ■ x + f t {x) + Ri :t (x). 

This update replaces the subgradient approximation of f t with the possibly non-linear f t . 



Closed-form implicit updates for the squared error case were derived by (Kivinen and War- 



muth 19971; the term implicit updates was coined later (Kivinen et al. 2006 1 . Our formu 



lation is similar to the online coordinate-dual-ascent algorithm briefly mentioned by |Shalev- 



Shwartz and Kakade (2008). In general, computing the implicit update might require solv- 
ing an arbitrary convex optimization problem (hence, the name implicit), however, in many 
useful applications it can be computed in closed form or by optimizing a one-dimensional 



problem. We discuss the advantages of implicit updates in Section 2.2 



Analysis of implicit updates has proved difficult. Kulis and Bartlett (2010) provide 
the only other regret bounds for implicit updates that match those of the explicit-update 
versions. While their analysis handles more general divergences, it only applies to mirror- 
descent algorithms. Our analysis handles composite objectives and applies FTRL algorithms 
as well as mirror descent. Our analysis also quantifies the one-step improvement in the 
regret bound obtained by the implicit update, showing the inequality is in fact strict when 
the implicit update is non-trivial. 

When ft is not differentiable, we use the update 

x t+1 = argmin g[. t ^ 1 ■ x + f t (x) + Ri :t (x), (1) 

X 

where g' t is a subgradient of f t at x t +\ (that is, g' t € df t (x t +i)) such that g[. t _i + g' t + 
V Ri:t(x t +i) = 0. The existence of such a subgradient is proved below, in Theorem [3] 

In many applications, we have a fixed convex function $ that we also wish to include 
in the optimization, for example ty(x) — \\x\\i (Li-regularization to induce sparsity) or the 



indicator function on a feasible set T (see Section 2.4). While it is possible to approximate 



this function via subgradients as well, when computationally feasible it is often better to 
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handle ^ directly. For example, in the case where ^(x) = ||x||i, subgradient approximations 
will in general not lead to sparse solutions. In this case, closed-form updates for optimizations 



including <f are often possible, and produce much better sparsity (Xiao 2009 Duchi and 



Singer 2009). We can include such a term directly in FTRL, giving the composite objective 



update 



%t+l = argmin f ^ Vf s (x s )j ■ x + a 1:t ^(x) + Ri :t (x), 



(2) 



where at is the weight on 5* on round t. 

This formulation, which allows for an arbitrary sequence of non-negative, non-increasing 
ctt's, i s more general than that supported by the original analysis of COMID or RDA. |Xiao| 
(2010 Sec 6.1) shows that RDA does allow a varying schedule where a\. t = c + \ j \ft for 



a constant c, by incorporating part of the $ term in the regularization function R t ; this is 
less general than our analysis, which allows the schedule at to be chosen independently of 
the learning rate. 

Finally, we can combine these ideas to define an implicit update with a composite ob- 
jective. In the general case where f t is not differentiable, we have the update 



Xt+i = argminj 1:t 



■x + ft(x) + a ut y(x) + Ri-. t (x), 



(3) 



where g' t £ df t {x t+ i) such that 3<p t G d$(x t +i) where g[. t _i + 9t + <t>t + VR 1:t (x t+ i) = 0. 
The existence of such a subgradient again follows from Theorem [3] 

It is worth noting that our analysis of implicit updates applies immediately to standard 
first-order updates. Let designate the loss function provided by the world, and let /" 
be the loss function in the update Eq. ([3]). Then we recover the non-implicit algorithms by 
taking f?(x)^Vf?(x t )-x. 



2.2 Motivation for Implicit Updates and Composite Objectives 

Implicit updates offer a number of advantages over using a subgradient approximation. 



Kulis and Bartlett (20101 discusses several important examples. They also observe that 



empirically, implicit updates outperform or nearly outperform linearized updates, and show 
more robustness to scaling of the data. 

Learning problems that use importance weights on examples are also a good candidate 
for implicit updates. Importance weights can be used to compress the training data, by 
replacing n copies of an example with one copy with weight n. They also arise in active 
learning algorithms (Bcygclzimer et al. 2010 1 and situations where the training and test dis- 
tributions differ (covariate shift, e.g. Sugiyama et al. (2008)). Recent work has demonstrated 
experimentally that implicit updates can significantly outperform first-order updates both 



on importance weighted and standard learning problems (Karampatziakis and Langford 



2010) 



The following simple examples demonstrates the intuition for these improvements. The 
key is that the linearization of ft over-estimates the decrease in loss under f t achieved by 
moving in the direction V/t(xt). The farther x t +i is chosen from xt, and the more non-linear 
the ft, the worse this approximation can be. Consider gradient descent in one dimension 
with ft(x) = \{x — 3) 2 and x t = 2. Then V/ t (2) = —1, and if we choose a learning rate 
r)t > 1, we will actually overshoot the optimum for f t (such a learning rate could be indicated 
by the theory if the feasible set is large, for example). Implicit updates, on the other hand, 
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will never choose Xt+i > 3, rather Xt+i — > 3 as r\ t — > oo. Thus, we see implicit updates can 
be significantly better behaved with large learning rates. Note that an importance weight 
of n is equivalent to multiplying the learning rate by n, so when importance weights can be 
large, implicit updates can be particularly beneficial. 

The overshooting issue is even more pronounced with non-smooth objectives, for exam- 
ple, ft(x) = g ■ x + \\x\\i. A standard gradient descent update will in general never set 
%t+i = despite the L\ regularization; handling the L\ term via an implicit update solves 
this problem. This is exactly the insight that COMID algorithms like FOBOS exploit; by 
analyzing general implicit updates, we achieve an analysis of these algorithms while also 
supporting a much larger class of updates. 

When the functional form of the non-smooth component of the objective (for example 
||a;||i) is fixed across rounds, it is preferable to perform an explicit optimization involving 
the total non-smooth contribution a.\.£$l (RDA and FTPRL) rather than just the round 
t contribution (COMID). While RDA supports this type of non-smooth objective, it 
requires the weight on 'J to be fixed across rounds. We generalize this to non-increasing 
per-round contributions in this work. 

Suppose one is performing online logistic regression, and believes a priori that the coef- 
ficients have a Laplacian distribution. Then, Li-penalized logistic regression corresponds to 
MAP estimation (e.g.,| Lee et~aL] ( |2006[ )); suppose the prior corresponds to a total penalty of 



A||x||i. If the size of the dataset T is known in advance, then we can use a t — A/T, and by 
making multiple passes over the data, we will converge to the MAP estimate. However, in 
the online setting we will in general not know T in advance, and we may wish to use an online 
algorithm for computational efficiency. In this case, any fixed value of at will correspond 
to strengthening the prior each time we see a new example, which is undesirable. With 
the generalized notion of composite updates introduced here, this problem is overcome by 
choosing ai^(x) = A||a;||i, and at — for t > 2. Thus, the fixed penalty on the coefficients 
is correctly encoded, independent of T. 

2.3 Summary of Regret Bounds 

In Section [4j we analyze the update rule of Equation ^ when 

R t {x)= l -\\Qhx-yt)\\\ (4) 

where || ■ || = || • H2 here and throughout. The points yt £ K n are the centers for the additional 
regularization added on each round. Choosing y t — leads to an analysis of RDA with 
implicit updates, and choosing y t = x t yields the follow-the-proximally-regularized-leader 
algorithm with implicit updates. Using y t — x t together with a modified choice of f t leads 
to composite-objective mirror descent (see Section [3]). 



The generalized learning rates Qt can be chosen adaptively using techniques from McMa- 



han and Streeter (2010) and Duchi et al. (2010a), which leads to improved regret bounds, 



as well as algorithms that perform much better in practice (Streeter and McMahan 2010 ). 
Since in this work we provide suitable regret bounds in terms of arbitrary Q t , the adaptive 
techniques can be applied directly. Doing so complicates the exposition somewhat, and so 
for simplicity and easy of comparison to previous results we state specific regret bounds for 
scalar learning rates: 

Corollary 1. Let ^ be the indicator function on a feasible set T , and let D — max ai (,g/ ||a— 
So that our bounds are comparable, suppose max ae jr||a|| = y (for example, if T is 
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symmetric) . Let f t be a sequence of convex loss functions such that ||V/t(x)|| < G for all t 
and all x G T . Then for FTPRL we set Xt = yt and have 

Regret < DGVtT. 

Implicit-update mirror descent obtains the same bound. For regularized dual averaging we 
choose yt — for all t, and obtain 

1 CD 

Regret < -DGV¥f + —= InT + 0(1). 

2 v2 

These bounds are achieved with an adaptive learning rate that depends only on t (T 
need not be known in advance). If T is known, then the \/2 constant on the \/T terms can 
be eliminated. The regret bounds with per-coordinate adaptive rates are at least as good, 
and often better. This corollary is a direct consequence of the following general result: 

Theorem 2. Let be an extended convex function on R™ with ty(x) > and G 8^(0), 
let ft be a sequence of convex loss functions, and let at G K be non-negative and non- 
increasing real numbers (0 < ctt+i < at)- Consider the FTRL algorithm that plays x% = 
and afterwards plays according to Equation 

x f+ i = argmin^. t _ 1 ■ x + f t {x) + a 1:t ^(x) + R\-. t {x), 

X 

II - II 2 

using incremental quadratic regularization functions Rt(x) — ^ ViQt \ x ~ Vt)\\ where Q\ £ 
S++, Qt e 5™ for t > 1, and y t £ R n . Then there exist g t G R n such that 



T 1 

Regret(f) < R UT (x) + ai :T ^(x) + J]( 5t - -gtYQ^t ~ 9tQi^Qt{yt ~ x t ) 

t=i 

T i 

< Ri;t{x) + ai :T ^(x) + ^2 -||Q 1:t 2 gt|| - Si-.t - gtQi^Qtivt - xt) 

t=i 

versus any point x G W 1 , for any g t G dft{x t ), with 8 > 0. 

We will show that gt is a certain subgradient of /(, and in fact when all at = 0, then 
gt G dft(x t +i). If ft is strictly convex, then in general g t ^ gt, and so the inequality between 
the first and second bounds can be strict; in fact, we will show that on rounds t where the 
implicit-update is non-trivial, S > 0, indicating a one-step advantage for implicit updates. 
When all at = 0, S t is one-half the improvement in the objective function of Equation ^ 
obtained by solving for the optimum point rather than using a solution from the linearized 



problem; the proof of Lemma 11 makes this precise. 

For RDA, we take yt — 0, and for FTPRL and implicit-update mirror descent we take 
y t — x t . Since no restrictions are placed on the y t in the theorem, the final right-hand term 
being subtracted could have be positive, negative, or zero. 

If we treat a^ as an intrinsic part of the problem, that is, we are measuring loss against 
ft{x) + at^(x), then the ai-.T^(x) term disappears from the regret bound. 

2.4 Notation and Technical Background 

We use the notation g\- t as a shorthand for Y] g s . Similarly we write Qi.t for a sum of 
matrices Q t , and we use / 1:t to denote the function fi-t(x) = Y^ s =i fsi x )- We assume the 
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summation binds more tightly than exponents, so Q\. t = (Qi.t)^ ■ We write x T y or x ■ y 
for the inner product between x,y G M". We write "the functions ft" for the sequence of 
functions ,/t)- 

We write 5™ for the set of symmetric positive semidefinite n x n matrices, with S*™ + 
the corresponding set of symmetric positive definite matrices. Recall A G S+ + means 
Vx 7^ 0, x T Ax > 0. Since A G S 1 ? is symmetric, x T Ay = y T Ax (we often use this result 
implicitly). For B g S 1 ™, we write J3 1 / 2 for the square root of £?, the unique X g 5™ such 



that XX = £? (see, for example, Boyd and Vandenberghe (2004 A. 5. 2)) 



Unless otherwise stated, convex functions are assumed to be extended, with domain 



and range RU {oo} (see, for example (Boyd and Vandenberghe, 2004 3.1.2)). For a convex 



function /, we let df(x) denote the set of subgradients of / at x (the subdiffcrcntial of 
/ at x). By definition, g G df{x) means f(y) > f(x) + g J (y — x) for all y. When / is 
differentiable, we write V/(x) for the gradient of / at x. In this case, df(x) = { V/(a;)}. 
All mins and argmins are over K n unless otherwise noted. We make frequent use of the 
following standard results, summarized as follows: 

Theorem 3. Let R : WL n — > R be strongly convex with continuous first partial derivatives, 
and let $ and f be arbitrary (extended) convex functions. Then, 

A. Let U(x) = R(x) + &(x). Then, there exists a unique pair (x*,(j)*) such that both 

(j)* G d$(x*) and x* = argmini?(a;) + (j)* ■ x. 

X 

Further, this x* is the unique minimizer of U , and VR(x*) + (f>* = 0. 

B. Let V(x) = R{x) + $(x) + f{x) and x — argmin^ V(x). Then, there exists a g G df(x) 
such that 

x = argmini?(x) + $(x) + g ■ x. 

X 

Proof. First we consider part A. Since R is strongly convex, U is strongly convex, and 
so has a unique minimizer x* (see for example, ( |Boyd and Vandenberghe] |2004| 9.1.2)). 
Let r = Vi?. Since x* is a minimizer of U, there must exist a (jf G d$(x*) such that 
r(x*) + 0* =0, as this is a necessary (and sufficient) condition for G dU(x*). It follows 
that x* = argmin^. R(x) + <fi* ■ x, as r(x*) + cf>* is the gradient of this objective at x* . 
Suppose some other (x' , <fi') satisfies the conditions of the theorem. Then, r(x') + <fi' = 0, 
and so G dU (a;'), and so x' is a minimizer of U. Since this minimizer is unique, x' — x* , and 
4>' = —r(x*) = <j)* . An equivalent condition to x* — argmin^, R(x)+(f>* -x is V R(x*)+(f>* = 0. 

For part B, by definition of optimality, there exists a <fi G d$(x) and a g G df(x) such 
that g + <p + VR(x) = 0. Choosing this g, define 

x = arg min R(x) + $(x) + g ■ x. 

X 

Applying part A with R(x) R(x) + g ■ x, there exists a unique pair (x, (j>) such that 
4> G d§(x) and VR(x) +4> + g = 0. Since (x, (j>) satisfy this equation, we conclude x = x. □ 

Feasible Sets In some applications, we may be restricted to only play points from a 
convex feasible set F G M. n , for example, the set of (fractional) paths between two nodes 
in a graph. A feasible set is also necessary to prove regret bounds against linear functions. 
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With composite updates, Equations ^ and ([3]), this is accomplished for free by choosing 
to be the indicator function Ijr on J 7 , where Ij^(x) — for x € T and oo otherwise. It is 
straightforward to verify that 

axgmmgut- x + Ri-.t(x) + If(x) = argrning^t • x + Ri-.t(x), 



and so in this work we can generalize (for example) the results of ( McMahan and Streeter| 



20101 for specific feasible sets without specifically discussing J 7 , and instead considering 
arbitrary extended convex functions vp. Note that in this case the choice of at does not 
matter as long as a\ > 0. 



3 Mirror Descent Follows The Leader 

In this section we consider the relationship between mirror descent algorithms (the simplest 
example being online gradient descent) and FTRL algorithms. Let ft(x) = g t ■ x + ^(x). 

Let R\ be strongly convex, with all the Rt convex. We assume that mxa x Ri(x) — 0, and 
assume that x = is the unique minimizer unless otherwise noted. 



Follow The Regularized Leader (FTRL) The simplest follow-the-regularized-leader 
algorithm plays 

x t+ i = argmingi :t • x + ^\\x\\l, (5) 

X ^ 

where <j\-t € R is the amount of stabilizing strong convexity added. 
A more general update is 

x t+1 = argmingi :t • x + Rut(x). 

X 

where we add an additional convex function R t on each round. When argmin. I , 6R „ Rt(x) = 0, 
we call the functions Rt (and associated algorithms) origin- centered. We can also define 
proximal versions of FTRlTHthat center additional regularization at the current point rather 
than at the origin. In this section, we write Rt{x) = R t (x — x t ) and reserve the R t notation 
for origin-centered functions. Note that Rt is only needed to select it+i, and x t is known 
to the algorithm at this point, ensuring the algorithm only needs access to the first t loss 
functions when computing x t +\ (as required). 



Mirror Descent The simplest version of mirror descent is gradient descent using a con- 
stant step size 77, which plays 

x t+ i = x t - r\g t = -rigut- (6) 

In order to get low regret, T must be known in advance so r\ can be chosen accordingly 
(or a doubling trick can be used). But, since there is a closed- form solution for the point 
Xt+i in terms of g\.t and 77, we generalize this to a "revisionist" algorithm that on each round 
plays the point that gradient descent with constant step size would have played if it had 
used step size rj t on rounds 1 through t — 1. That is, x t+1 = —rj t gi : t- When Rt(x) = ?f H^ll! 
and rjt = 5^— , this is equivalent to the FTRL of Equation |5]) . 

'We adapt the name "proximal" from |Do et aL||2009[ l, but note that while similar proximal regularization 
functions were considered, that paper deals only with gradient descent algorithms, not FTRL. 
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In general, we will be more interested in gradient descent algorithms which use an adap- 
tive step size that depends (at least) on the round t. Using a variable step size r\ t on each 
round, gradient descent plays: 

x t+ i = x t - r) t g t . (7) 
An intuition for this update comes from the fact it can be re-written as 

x t +i = argmin.gt • x + — ||x - x t \\l- 

This version captures the notion (in online learning terms) that we don't want to change 
our hypothesis Xt too much (for fear of predicting badly on examples we have already seen) , 
but we do want to move in a direction that decreases the loss of our hypothesis on the most 
recently seen example. Here, this is approximated by the linear function g t , but implicit 
updates use the exact loss /{. 

Mirror descent algorithms use this intuition, replacing the _L 2 _sc |uared penalty with an 
arbitrary Bregman divergence. For a differentiable, strictly convex i?, the corresponding 
Bregman divergence is 

B R {x, y) = R(x) - (R(y) + VR{y) ■ (x - y)) 
for any x, y £ K". We then have the update 

x t+1 = argmin5 t • x-\ B R (x,x t ), (8) 

x m 

or explicitly (by setting the gradient of pi to zero), 

Xt+i = r _1 (r(x t ) - i] t g t ) (9) 

where r — VR. Letting R(x) — so that B]i(x,xt) = h\\x — Xfl^ recovers the algorithm 

of Equation (JT]). One way to see this is to note that r(x) — r^ 1 (x) — x in this case. 

We can generalize this even further by adding a new strongly convex function R t to the 
Bregman divergence on each round. Namely, let 

t 

B\: t {x,y) = ^2B Rs (x,y), 

s=l 

so the update becomes 

x t +i = argmingt • x + Bi :t (x, x t ) (10) 

X 

or equivalently x t+1 = (r 1:t )- l (rx :t (x t ) - g t ) where r 1:t = J2l=i ^ R t = VR-i-.t and (r^) -1 is 
the inverse of ri- t . The step size r] t is now encode d implicitly in the ch oice of Rt- 

Composite-objective mirror descent (COMID) puchi et al"||2010b| | handles * function^ 



as part of the objective on each round: ft(x) = g t -x + ^>(x). Using our notation, the COMID 
update is 

x t+ i = argmin?/g t ■ x + B{x,x t ) + r)^f(x), 

X 

which can be generalized to 

x t+ i = argming t • x + + Bi :t (x, x t ), (11) 



2 Our * is denoted r in (Duchi et al. 2010b 
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where the learning rate r\ has been rolled into the definition of R\, . . . , Rt- When W is chosen 
to be the indicator function on a convex set, COMID reduces to standard mirror descent 
with greedy projection. 



3.1 An Equivalence Theorem for Proximal Regularization 

The following theorem shows that mirror descent algorithms can be viewed as FTRL algo- 
rithms: 

Theorem 4. Let R t be a sequence of differentiable origin- centered convex functions (Vi?t(0) = 
0), with Ri strongly convex, and let ^ be an arbitrary convex function. Let X\ = X\ = 0. For 
a sequence of loss functions ft(x) + ^(x), let the sequence of points played by the implicit- 
update composite-objective mirror descent algorithm be 

i t+ i = argmin f t (x) + a t ^{x) + Bi-. t {x, x t ), (12) 

X 

where Rt(x) = Rt(x — x t ), and B t = B^ , so B\- t is the Bregman divergence with respect to 

Ri + • • • + R t . Consider the alternative sequence of points Xt played by a proximal FTRL 
algorithm, applied to these same f t , defined by 

x t+ i = argmin {g[. t _ 1 + (pi-.t-i) ■ x + f t (x) + a f ^(x) + R 1:t (x) (13) 

X 

for some g' t G df t (x t +i) and <j) t € d(a t ^)(x t+ i) . Then, these algorithms are equivalent, in 
that x t = x t for all t > 0. 

We defer the proof to the end of this section. The Bregman divergences used by mirror 
descent in the theorem are with respect to the proximal functions R\-t, whereas typically (as 



in Equation (10)) these functions would not depend on the previous points played. We will 



show when Rt(x) — ^\\Qt x \\2i this issue disappears. Considering arbitrary ^> functions and 
implicit updates also complicates the theorem statement somewhat. The following corollary 
sidesteps these complexities, to state a simple direct equivalence result: 

Corollary 5. Let ft(x) = gt ■ x. Then, the following algorithms play identical points: 

• Gradient descent with positive semi-definite learning rates Q t , defined by: 

%t+l = x t ~~ Q\lt9t- 



• FTRL-Proximal with regularization functions Rt(x) = k\\Qt ( x ~ x t)\\o! which plays 



x t+ i = argmin g 1:t ■ x + R\-.t{x). 

X 

Proof. Let Rt(x) = \x^ Qtx. It is easy to show that R± : t and Ri-t differ by only a linear 
function, and so (by a standard result) Bx-.t and B\-t are equal, and simple algebra reveals 

Bi-. t (x,y) = B x .. t {x,y) = -\\Ql, t {x - y)\\%. 

Then, it follows from Equation ^ that the first algorithm is a mirror descent algorithm 
using this Bregman divergence. Taking ^(x) — and hence = 0, the result follows from 
Theorem |U □ 
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Extending the approach of the corollary to FOBOS, we see the only difference between 
that algorithm and FTRL-Proximal is that FTRL-Proximal optimizes over fr$>(x), whereas 
in Equation ( 13 ) we optimize over ^l-.t—i ' x + *(x) (see Table [I)- Thus, FOBOS is equivalent 



to FTRL-Proximal, except that FOBOS approximates all but the most recent \& function 
by a subgradient. 

The behavior of FTRL-Proximal can thus be different from COMID when a non-trivial 
\P is used. While we are most concerned with the choice $>(x) = A||x||i, it is also worth 
considering what happens when 'J is the indicator function on a feasible set J- . Then, 
Theorem [4] shows that mirror descent on ft{x) — g t ■ x + 'J'(x) (equivalent to COMID in 
this case) approximates previously seen "Js by their subgradients, whereas FTRL-Proximal 
optimizes over ^ explicitly. In this case, it can be shown that the mirror-descent update 



corresponds to the standard greedy projection (Zinkevich 



corresponds to a lazy projection (McMahan and Streeter 



2003), whereas FTRL-Proximal 



2010) 



For the analysis in Section |4j we will use this special case for quadratic regularization: 
Corollary 6. Consider Implicit- Update Composite- Objective Mirror Descent, which plays 

1 1 2 

x t+1 = a,rgminf t (x) + a t ^(x) + -\\Ql t (x - x t )\\ . (14) 



2 



Then an equivalent FTPRL update is 



1 * 

1 X - ^ - l|2 

x t+ i = argmin + 0i :f -i) • x + f t (x) + a t ^(x) + - ) QJ (x - x s )\\ (15) 



2 ^ 

s=l 



for some g' t £ df t (x t+1 ) and (f> t £ d{a t ^)(x t+ i). 



Again let /™ be the loss functions provided by the world, and let /" be the functions 
defining the update and used in the above corollary. Then, we encode implicit mirror descent 
by taking /"(x) 4— /™(x) + a t ^{x). We recover standard (non-implicit) COMID by taking 
/ t M (x) <— V f™(xt) ■ x + a t ^(x). Applying this result leads to the expression for COMID in 
Table [U 

Note that in both cases, the ^ listed separately in Eq. ^ is taken to be zero; the $ 
specified in the problem only enters into the update through the That is, we don't 
actually need the machinery developed in this work for composite updates, rather we get an 
analysis of mirror-descent style composite updates via our analysis of implicit updates. The 
machinery for explicitly handling the full cui.t^ penalty should be used in practice, however 



(see Section 2.2). Note also that the standard COMID algorithm can thus be viewed as a 
half-implicit algorithm: it uses an implicit update with respect to the \& term, but applies 
an immediate subgradient approximation to . 

We conclude the section with the proof of the main equivalence result. 
Proof of Theorem [4] For simplicity we consider the case where ft is differentiable^] By 
applying Theorem [3] to Eq. ( 13 ) (taking <f> to be all the terms other than the cumulative 

3 |Zinkevich| | |2004| Sec. 5.2.3) describes a different lazy projection algorithm, which requires an appropri- 
ately chosen constant step-size to get low regret. FTRL-Proximal does not suffer from this problem, because 
it always centers the additional regularization Rt at points in J 7 , whereas our results show the algorithm of 
Zinkevich centers the additional regularization outside of T , at the optimum of the unconstrained optimiza- 
tion. This leads to the high regret in the case of standard adaptive step sizes, because the algorithm can get 
"stuck" too far outside the feasible set to make it back to the other side. 

4 This ensures both g' t and <j>t are uniquely determined; the proof still holds for general convex ft, but 
only the sum g' t + <j> t will be uniquely determined. 
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regularization), there exists a 4>t € d(at^)(xt+i) such that g' t — V ft{xt+\) and 

g' 1:t + <f>ut + VRi:t{xt+i) = 0. 



(16) 



Similarly, applying Theorem [i] to Eq. (12 1 implies there exists a <fi t € 9(a t vl')(a; t+1 ) such 
that $ = V/t(ft+l) an d 



$ + 4> t + \/R 1:t {xt+i) ~ Vfll:t(«t) = 0, 



(17) 



recalling that V u B R (u,v) = S/R(u) - S/R(v). 

We now proceed by induction on t, with the induction hypothesis that Xt — it- The base 
case t = 1 follows from the assumption that = = 0. Suppose the induction hypothesis 
holds for t. Taking Eq. (16 1 for t — 1 gives g' 1 . t _ 1 + <t>\-.t-\ + V Ri-.t-i(xt) = 0, and since 
VRt{x t ) — 0, we have 

-vR l ..t{x t ) = g' 1 .. t _ l + <i>i;t-i (18) 

Beginning from Eq. (|17[), 



g' t + 4> t +V R 1:t (x t+ i) - VRi-.t{xt) 

= g' t + <f> t + VR 1:t (x t+1 ) - VRi: t (x t ) 

= 9v.t-i +9t + <f>ut-i + <f>t + VR 1:t (x t+ i), 



by the I.H. 



(19) 



where the last line uses Eq. ( |18[ ). The proof follows by applying Lemma [3] to Eqs. (16) and 
(19), and considering the pairs (g' t + <fi t ,xt+i) and (g' t + f ,x t+1 ). The equality <p t = cf> t 
follows from the fact that g' t = g' t since ft is differentiable. ■ 



3.2 An Equivalence Theorem for Origin-Centered Regularization 



For the moment, suppose ^>(x) = 0. So far, we have shown conditions under which gradient 
descent on ft(x) = gt ■ x with an adaptive step size is equivalent to follow-the-proximally- 
regularized-leader. In this section, we show that mirror descent on the regularized func- 
tions ft(x) — g t ■ x + R t (x), with a certain natural step-size, is equivalent to a follow- 
the-regularized-leader algorithm with origin-centered regularization. For simplicity, in this 
section we restrict our attention to linear f t (equivalently, non-implicit updates). The ex- 
tension to implicit updates is straightforward. 

The algorithm schema we consider next was introduced by Bartlett et al. ( 2007| Theorem 
2.1). Letting R t (x) = ^f\\x\\l and fixing i] t — ^— , their adaptive online gradient descent 
algorithm is 

x t +i =x t - J/tV f?(x t ) = x t - r) t (gt + (TtXt))- 

We show (in Corollary[8]) that this algorithm is identical to follow-the-leader on the functions 
ff"(x) = gt ■ x + Rt{x), an algorithm that is minim ax optimal in terms of regret against 
quadratic functions like f R ( jAbernethy et alTj |2008[ ). As with the previous theorem, the 
difference between the two is how they handle an arbitrary If one uses Rt(x) = jf-\\x— fftHl 



in place of R t (x) , this algorithm reduces to standard online gradient descent ( Do et al. 2009 ) 



The key observation of Bartlett et al. ( 2007 ) is that if the underlying functions ft have 
strong convexity, we can roll that into the Rt functions, and so introduce less additional 
stabilizing regularization, leading to regret bounds that interpolate between y/T for linear 
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functions and logT for strongly convex functions. Their work did not consider composite 
objectives terms), but our equivalence theorems show their adaptivity techniques can be 
lifted to algorithms like RDA and FTRL-Proximal that handle such non-smooth functions 
more effectively than mirror descent formulations. 

We will prove our equivalence theorem for a generalized versions of the algorithm. Instead 



of vanilla gradient descent, we analyze the mirror descent algorithm of Equation (11 1, but 
now <7t is replaced by Vff~(x t ), and we add the composite term ty(x). 

Theorem 7. Let ft(x) = gt ■ x, and let f^(x) = gt • x + Rt(x), where Rt is a differentiate 
convex function. Let "J be an arbitrary convex function. Consider the composite- objective 
mirror- descent algorithm which plays 

x t+ i = axgminV/fOrt) • x + V(x) + B 1 . t {x,x t ), (20) 

X 

and the FTRL algorithm which plays 

,r <+ i = argmin/^^x) + c/)ut-i • x + *(x), (21) 

X 

for 4> t <E d^(xt+i) such that gi-t + V Ri-t(xt+i) + 4>ut-i + <f>t = 0. // both algorithms play 
X\ = x\ = 0, then they are equivalent, in that xt — Xt for all t > 0. 

The most important corollary of this result is that it lets us add the adaptive online 
gradient descent algorithm to Table [l] It is also instructive to specialize to the simplest case 
when ^(x) — and the regularization is quadratic: 

Corollary 8. Let ft(x) — gt ■ x and f^{x) = gt ■ x + ^U^Hl- Then the following algorithms 
play identical points: 

• FTRL, which plays x t +i — argmin^. f^ t (x). 

• Gradient descent on the functions f R using the step size r/ t — , which plays 

x t+ i =x t - r/tVffixt) 

• Revisionist constant-step size gradient descent with r\ t = -^~ t , which plays 

xt+i = -vtgi-.t- 

The last equivalence in the corollary follows from deriving the closed form for the point 
played by FTRL. We now proceed to the proof of the general theorem: 
Proof of Theorem [7] The proof is by induction, using the induction hypothesis x t = x t . 
The base case for t = 1 follows by inspection. Suppose the induction hypothesis holds for t; 



we will show it also holds for t + 1. Again let r t — Vi? t and consider Equation (21 ). Since 
i?i is assumed to be strongly convex, applying Theorem [3] gives us that Xt is the unique 
solution to V//f ( „ 1 (x t ) + <j>i-.t-\ — an d so gx-.t-i + lit-iO^t) + <f>i-.t-i — 0. Then, by the 
induction hypothesis, 

- ri:t-i{x t ) = gv.t-i + 4>ut-i- (22) 



Now consider Equation (20). Since R\ is strongly convex, B\ : t{x,Xt) is strongly convex 
in its first argument, and so by Theorem[3]we have that it+i and some 4>' t S d^(xt+i) are 
the unique solution to 

V/tW + <P't + rut{x t+ i) - ri:t(x t ) = 0, 
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since V p Br(p, q) = r(p) — r(q). Beginning from this equation, 

= Vjf (£ t ) + <j)' t + rx:t[x t +i) - ri: t (x t ) 
= g t + r t (x t ) + cj>' t + ri; t {x t+ i) - n-.t(xt) 
= 9t + ri;t{x t+ \) +4>' t - ri;t-i(x t ) 
= 9t + ri-.t(x t +i) + 4>' t + 9l:t-\ + 4>l:t-l 

= 9l:t + ri-.t{x t+1 ) + (j>l :t -l + 4>'f 



Eq (22| 



Applying Theorem [3] to Equation (21 ), (xt+i, <ftt) are the unique pair such that 

9i-.t + ri:t{x t +i) + <put-i + <t>t = 
and 4> t € d^(xt+i), and so we conclude i t+1 = x t +i and <f/ t = 4>t- 



4 Regret Analysis 

In this section, we prove the regret bounds of Theorem[2]and Corollaryfl] Recall the general 
update we analyze is 

x t +i = argminpi.^! • x + f t (x) + ai-. t ^{x) + Ri- t {x) Q 

X 

where g' t € dft{xt+\). It will be useful to consider the equivalent (by Theorem [3]) update 
x t+1 = argming' 1:t • x + a 1:t ^{x) + Ri :t (x). (23) 

X 

We can view this alternative update as running FTRL on the linear approximations of f t 
taken at Xt+i, 

ft(x) = ft{x t+ i)+g' t ■ (x - x t +i). 

To see the equivalence, note the constant terms in / change neither the argmin nor regret. 
This is still an implicit update, as implementing the update requires an oracle to compute 
an appropriate subgradient g' t (say, by finding Xt+i via Equation ([3])). 

This re-interpretation is essential, as it lets us analyze a follow-the-leader algorithm on 
convex functions; note that the objective function of Equation ^ is not the sum of one 
convex function per round, as when moving from Xt-i to x t we effectively add g' t _i ■ x — 
ft-i(x) + ft(x) to the objective, which is not in general convex. By immediately applying 
an appropriate linearization of the loss functions, we avoid this non-convexity. 

The affine functions / lower bound ft , and so can be used to lower bound the loss of any 
x; however, in contrast to the more typical subgradient approximations taken at x±, these 
linear functions are not tight at x t , and so our analysis must also account for the additional 
loss ft(xt) — ft(x t ). Before formalizing these arguments in the proof of Theorem |2j we prove 
the following lemma. We will use this lemma to get a tight bound on the regret of the 
algorithm against the linearized functions /, but it is in fact much more general. 

Lemma 9 (Strong FTRL Lemma). Let f t be a sequence of arbitrary (e.g., non-convex) loss 
functions, and let Rt be arbitrary non-negative regularization functions. Define f^(x) = 



1G 



ft(x) + Rt{x). Then, if we play x t +i — argmin^ f^ t (x), our regret against the functions ft 
versus an arbitrary point x is bounded by 

T 

Regret < R 1:T (x) + £ (f£ t (x t ) - f£ t (x t+1 ) - R t (x t )^ . 
t=i 

A weaker (though sometimes easier to use) version of this lemma, stating 

T 

Regret < R 1:T (x) + J2 {ftfa) ~ ft{x t +i)), 



has been used previously (Kalai and Vempala 2005 Hazan 2008 McMahan and Streeter 



2010|). In the case of linear fu nctions with quadratic regularization, as in the analysis of 
(2010), the weaker version loses a factor of ^ (corresponding to a 



McMahan and Strcctcr 



•\/2 in the final bound). The key is that in that case, being the leader is strictly better 
than playing the post-hoc optimal point. Quantifying this difference leads to the improved 
bounds for FTPRL in this paper. 

Proof of Lemma [9] First, we consider regret against the functions f R for not playing x: 



Regret(/ 7? ) = £(/ t fl (a: t ) - f t R (x)) by definition 

t=i 

T 

t=l 

T 

= E(/* - AVi(zt)) - /&■(*) where f 1:0 (x) = 

t=i 

T 

^ ^2(fSti x t) - fut-ii x t)) - fuT( x T+i) since x T+1 minimizes f R 7 
t=i 

= E(/i fl t K)-/S(^+i)), 



*=i 



where the last line follows by simply re-indexing the —f R t terms. Equivalently, applying the 
definitions of regret and f R , 

T T 
^2(ft(xt) + Rt(xt)) - fl:T(x) ~ Rl:T(x) < ^2(fMx t ) - f? t (x t+1 )). 



Re-arranging the inequality proves the theorem. 



With this lemma in hand, we turn to our main proof. It is worth noting that the second 
half of the proof simplifies significantly when we choosing x t — yt, as in FTPRL. 
Proof of Theorem [2] Recall ft(x) — ft(xt+\) + g' t ■ (x — Xt+\), a linear approximation of 
ft taken at the next point, Xt+i- We can bound the regret of our algorithm (expressed as an 
FTRL algorithm on the functions ft, Equation (23)) against the functions ft by applying 
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Lemma[9]to the functions ft with regularization functions R' t (x) = i?t(x) + at^(x). Because 
we are taking the linear approximation at Xt+i instead of xt, it may be the case that 
our actual loss ft{xt) on round t is greater than the loss under / t , that is we may have 
/t(x t ) > /t(x t ). Thus, we must account for this additional regret. From the definition of 
regret we have 

T 

Regret(/) = Regret(/) + - ftfa)) + (Lt(x) ~ fx-.t(x)) 

i=l 
T 

< Regret (/) + (/*(**) - / t (x t )) 

since / t lower bounds ft, and letting ff(x) = /t(x) + i?^(x), 

T T 

< i?' 1:T (x) + ^(/^(xt) - /« (x t+1 ) - i^(x t )) + X)(/t(s t ) - 



Lenmia|oii / t and flj Underestimate of real loss at a; t 

Let At be the contribution of the non-regularization terms for a particular t, 
A t = f*(x t ) - fFdxt+i) + ft(xt) - /t(x t ), 

= fl:t(x t ) + R[. t (x t ) ~ /l:t(x t+ l) - R' 1:t (x t+1 ) + f t (x t ) ~ /t(x t ), 

= /l:i-l(Xf) + R[.. t (xt) ~ fl:t(x t +l) - R'x.tixt+l) + ft(xt), 

= (ht-l{xt) + R' 1:t (x t ) + ft(xt)) - (/l:t(x t+ l) + i?' 1:t (x t+1 )). 

For the terms containing Xt+i, using the fact that ft{xt+\) — /(xt+i), we have 

A:t(x t+ i) + R[.Axt+i) = h:t-x(xt+i) + R'vAxt+i) + /t(x t+ i). (24) 
For a fixed t, we define two helper functions hi and h 2 . Let 

h 2 {x) = /i :t -i(x) + Ri :t (x) + ai :t <&(x) + /t(x), 
so At = h 2 (x t ) - h 2 (x t +i). Define 

hi(x) = /i :t _i(x) + R 1:t ^i(x) + ai :t _i*(x). 

Then we can write 

h 2 (x) = hi(x) + f t (x) + R t (x) + a t *(x). 



By definition of our updates, Xf = arg min^, hi (x) (using Eq. ( 23 )) and Xt+i = arg min^, h 2 (x) . 

Now, suppose we choose regularization R t (x) — \\\Qt (x — yt)\\ 2 - The remainder of the 
proof is accomplished by bounding h 2 (xt) — h 2 (x t +i), with the aid of two lemmas (stated 
and proved below). First, by expanding hi and dropping constant terms (which cancel from 
At), we have 

1 1 

hi{x) = -x T Q 1:t ~ix + \cfx\t-i - - ^ QsVsj ■ x + ai : t-i*(x) 

s=l 

In 1 n2 

= -j\\Ql,t-i{x - x t )\\ +^{x) + k' t Lemma flOl 
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for some constant k' t g M. Recall Qx-t-i = (*3i + ' ' "+Qt-l) * ■ Now, we can apply Lemma[l~ij 
The constant k' t cancels out, and we take Q a = Qi-.t-i, Qb = Qt, &a = ^, 3?& = a^, x\ = a:*, 
etc. Thus, letting dt = yt — x t , 

A t = hi(x t ) - h 2 (x t+1 ) 

< (9t ~ \gt) T Qi} t 9t + g ||Qi:t (Qtdt) f -9QZ}Qtdt +a t *(x t ) -a t *(x t+1 ). (25) 

We now re-incorporate the —R' t (xt) terms not included in the definition of A t . Note 
R't(x) > R t (x), and R t (x t ) = ^\\Qld t \\ 2 . Then 

\\\QlhQtdt)\\ 2 - \\\QUtf = \dl ~Ql QZiQtdt - \djQ t d t 



< \dlQlQ- t x Q t d t - ~djQ t dt = 



where we have used the fact that Q\. t >z Qt h implies Q t 1 > Q x . t y 0. Combining this 
result with Eq. (25) and adding back the R\-t{x) term gives 

Regret < R 1:t {x) + ^ ((g t - ^9t) T Ql^9t - 9tQTuQt(Vt ~ %t) + a t ^(x t ) - a t ^{x t+ i) 
t=i 

Defining ctr+i — 0, observe 

T T 

^ a t ^{x t ) - a t ^(x t+1 ) = ^ at^(xt) - a t+ i^(x t +i) + (a t +i - a t )^{x t+ i) 
4=1 t=i 

T 

< a t^(xt) - a t+1 ^(xt+i) 
i=l 

= - a T+1 ^(x T+1 ) = 

where the inequality uses the fact that < a,t+i < at and ^(x) > 0. The last equality 
follows from ^{x\) = ^(0) = and or+i = 0. Thus we conclude 

T 1 

Regret < R ut (x) + ^(gt - ~9t) T Qi*9t ~ gjQ^Qtiyt ~ x t ). 



The second inequality in the theorem statement follows from Equation (28) of Lemma 



11 



Only Ri [t (x) and the last term in the bound depend on the center of the regularization y t ; 
the final term can either increase or decrease regret, depending on the relationship between 
gt and yt — Xt (note g t is not known when y t is selected). If we consider the simple case 
where all Q t = cr t I, observe that if — g t ■ (y t — x t ) > then (roughly speaking) both the new 
regularization penalty and the gradient of the loss function are pulling x t +i away from Xt 
in the same direction, and so regret from this term will be larger. 
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Proof of Corollary [T] We first consider FTPRL. Let Q t = a t l 1 and define <j t such that 
<j\:t = Gy/2~i/D. Then taking Theorem [i] with xt — yt gives 



T T 

CTt || o ||2 V"^ 5t 



..2 

Regret <^||z-z t f + £ 

t=i t=i i£ 



2 ^ 2a vt 

t=i i - t 

GD\f¥T GD ' A 



y- 



2 2V2f^V~t 
< DGV2T, 



where the last inequality uses the fact that Y^t=i ^ — 2VT. 

Recall the characterization of implicit-update mirror descent from Section [3j Thus, in 
this case we have /"(a;) <— f™(x) + Ij^(x). Let g™ = V f^{xt), so in the analysis we have 

= g™ + cj) t . Following standard arguments, e.g. (Bartlett et al. 2007| |Duchi et al. 



2010b I, it is straightforward to use the Pythagorean theorem for Bregman divergences to 

show 5||5t"|| ^ \ \\9t || i an d then the result follows as for FTPRL. 

For regularized dual averaging we have yt — 0. Again let Q t = (Xfl, and define at such 
that <7i :i = 2G\/2tjD. Then, Theorem [| gives 

t t 2 

<Jt 

-gt ■ x t . 

' ■ 1 ■ 1 1 — " f r>~ i . - a i . 



The proof is largely similar to that for FTPRL, but we must deal with an extra term. First, 
note for t > 2, 

2G v / 2 / r , , 2G v / 2 / 1 \ 2G 

at = ax-.t - ai-.t-i = —^-{Vi - Vt^T) < — — — == < 



D v ' ~ D \2y/t^l) ~ Dy/i' 

where we have used y/i - y/t-l < ^= and for t > 2, l/x/t - 1 < y/2/y/i. Then, noting 
the term for t = 1 is zero since x\ = 0, 

y_ZL 3t . Xt < GjD y^L< Gc y^ D ^^< G ^yi< G ^ (lnT + 1) . 



Applying this observation, 



T T 



Regret < ^ -y||x|| +^ 



2 



ft n oii2 , v-~* 9t &t 



2 ' 2cr i Ci .* 

t=l t=l 1 - t 1 - t 

2 T ^ 2 



GDV2T GD ^ 1 GD , 

= — + ^1^ + 73 " ,r+0<1) 

< -DGV2T+ — =-lnT + 0(l). 
" 2 ^2 
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We now prove the two lemmas used in bounding the hi(x t ) — h 2 (x t+1 ) terms in the proof 
of Theorem [2 

Lemma 10. Let ^ be a convex function defined on W 1 , and let Q G S™ + . Define 

h(x) = ^x T Qx + b ■ x + 'i'(x), 
and let x* = argmin^ h(x). Then, we can rewrite h as 

In i i,2 

K x ) = ^\\Q^{x- x*)\\ +*(x) + fc, 

where k £ K and $ is convex with € d^(x*). 

Proof. Since Q G S™ + , the function ix T Qa; is strongly convex, and so using Theorem [3] h 
has a unique minimizer x* and there exists a (unique) </> such that 

Qx* + b + <j) = (26) 

with G d^(x*). Define *(a;) = ~&(x) — <f> ■ x, and note £ 9f (1*). Then, 

h(x) = ^x T Qx + b ■ x + *(x) 

= ^x T Qx + (b + 4>) ■ x + 4>(x) Defn. #(2;) 

= ^x T Qx - x T Qx* + i>(x) Eq. (26]) 

=-\\Qi(x-x*)\\ 2 + V(x) - ^IQ^*!) 2 , 

where ^ and /c = — IH*?^*!) 2 satisfy the requirements of the theorem. □ 

Lemma 11. Let x\ G W l , let $ a be a convex function such that G d<fr a (xi), and let 
Qa € £++• Define 

111 1 l|2 

= (a; — a;i)|| +$ a (x), 
so x\ = argmin^. hi(x). Let f and $/, be convex functions, let Qb G S*™, and define 

h 2 (x) = hx{x) + f(x) + \\\Ql{x -y)f + $ b (x). 

Let X2 = argmin^. h-z{x), let g G df(xi), let d — y — x\, and let Q a -.b = Qa + Qb- Then, there 
exists a certain subgradient g of f such that 

h 2 (x 1 ) ~ h 2 (x 2 ) < {g-\g) T Q-l9 + \\\QlhQbd)\\ 2 -g T Q-.lQ b d + Mxi)-Mx2) (27) 



Further, 



where S > 0. 



{9-\9) T Qa-b9<\g T Q-.b9-Z (28) 



21 



As we will see in the proof, S > when the implicit update is non-trivial. 

Proof. To obtain these bounds, we first analyze the problem without the <I> terms. For this 
purpose, we define 



1 1 1 i , , 2 111 1 

h 2 {x) = -\\QZ(x-xi)\\ + -||Q h 2 (x- y)\\ + f(x), 



and let x 2 = argmin^, h 2 (x). We can re- write 

h 2 {X) = f(x) + -\\Qa (x - Xi)\\ + - ||Q b 2 (x-xi - d)\\ 

= /(*) + \\\qL(? ~ *i)f - d T Q b (x - Xl ) + \\\qU\\ 2 - 

Then, using Theorem [3] on the last expression, there exists age df(x 2 ) such that g + 
Qa-.b(x2 — xi) — Q b d = 0, and so in particular 

x 2 -x x = Q-l{Q b d-g). (29) 

Then, 
h 2 {xi) - h 2 (x 2 ) 

= f(xi) + \\\QUf - f{x 2 ) - \\\Ql b {x 2 - x x )f + d T Q b (x 2 - Xl ) - \\\Qld\\ 2 

1 I 2 T 

= f{xi) - f(x 2 ) - ~ \\Qa: b (x 2 - Xl)\\ +d Q b (x 2 -Xl), 

and since f(x 2 ) > f(x±) + g(x 2 - x-y) implies f(xi) - f(x 2 ) < -g(x 2 - xx), 

1 1 2 T 

^ -2\\Qa:b(%2 -xi)\\ +{Q b d~g) (x 2 -Xi) 



and applying Eq. ( 29 1 
1 



iQhd -g)\\ + (Qbd - g) 1 (Q-.i(Q b d - g)) 

9 T Q~l~9- l\\Qj^ 2 + hQj(Qbd)\\ 2 -g T Q-iQ b d, 



and so we conclude 

h 2 (xi)-h 2 (x 2 )< (g-\g) T Q-} b g+l\\Q:hQbd)\\ 2 -g T Q- 1 b Q b d. (30) 

Next, we quantify the advantage offered by implicit updates. Suppose we choose x 2 by 
optimizing a version of h 2 where / is linearized at x\\ 

1 i m2 1 1| i m2 

hi{x) = -\\Qa [x - si) || + -\\Q§ (x-y)\\ + g ■ x. 

Let x 2 — argmnXj, h 2 (x). We say the implicit update is non-trivial when h 2 {x 2 ) < h 2 (x 2 ), 
that is, the implicit update provides a better solution to the optimization problem defined 
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by h 2 . By definition h 2 (x 2 ) < h 2 {x 2 ), and we can write h 2 (x 2 ) = h 2 {x 2 ) — 2<5 with <5 > 0. 

II 1 II 2 || - ||2 

M/1 ' ' ■ " 1 1 ?J ' , ; — I Q^(ic — y) . Then, by the definition of x 2 



Let Ri(x) = §||Qd (x — Xi) || and i?2(x) 
and a; 2 we have 



^1:2(^2) + 9 ■ X 2 < Rl :2 (x 2 ) +9-X 2 
Rl:2( x 2) +g-X 2 = i?l :2 (5 2 ) + 9 ■ X 2 - 2(5 

and adding and canceling terms common to both sides gives 

g ■ x 2 + g ■ i 2 < g ■ x 2 + g ■ x 2 - 26. 



(31) 



Following Equation (29 1 x 2 = Q^\{Qbd—g)+Xi or x 2 — —Q~\g+n where k = Q~\Qbd+x\. 
Similarly, x 2 — —Q~. b g + Plugging into Equation (31 1, and noting the k terms cancel, we 
have 

-g T Q-. b g - g T Qal~9 < ~g T Q~l~9 - fQlla - ^ 



or re-arranging and dividing by one-half, 

\g T Qal9-8>(g- l -g) T Q-} b 9, 
We now consider the functions that include the <& terms. Note 

h 2 (x 2 ) = h 2 {x 2 ) + <5> a (x 2 ) + <P b (x 2 ) > h 2 (x 2 ) + <f> a {xi) + ®b(x 2 ). 

Then, 

h 2 (xi) - h 2 (x 2 ) = h 2 (xi) + $ a (xi) + $b(xi) - h 2 (x 2 ) 

< h 2 { Xl ) + § a {xi) + $b{xi) ~ h 2 (x 2 ) - <P a {xi) - $ b (x 2 ) 
= h 2 (xx) - h 2 (x 2 ) + <S>b(xi) - $ 6 (x 2 ). 



(32) 



Combining this fact with Equations ( 30 ) and ( 32 1 proves the theorem 



□ 



5 Experiments with L\ Regularization 

We compare FOBOS, FTRL-Proximal, and RDA on a variety of datasets to illustrate the 
key differences between the algorithms, from the point of view of introducing sparsity with 
L\ regularization. In all experiments we optimize log-loss (see Section [2| . Since our goal 
here is to show the impact of the different choices of regularization and the handling of the 
L\ penalty, for simplicity we use first-order updates rather than implicit updates for the 
log-loss term. 



For an experimental evaluation of implicit updates, we refer the reader to |Karampatzi 
akis and Langford (2010), which provides a convincing demonstration of the advantages of 



implicit updates on both importance weighted and standard learning problems. 

Binary Classification We compare FTRL-Proximal, RDA, and FOBOS on several pub- 
lic datasets. We used four sentiment classification data sets (Books, Dvd, Electronics, and 
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Table 2: AUC (area under the ROC curve) for online predictions and sparsity in parentheses. 
The best value for each dataset is shown in bold. For these experiments, A was fixed at 
0.05/T. 

Data FTRL-Proximal RDA FOBOS 

books 0.874 (0.081) 0.878 (0.079) 0.877 (0.382) 

dvd 0.884 (0.078) 0.886 (0.075) 0.887 (0.354) 

electronics 0.916 (0.114) 0.919 (0.113) 0.918 (0.399) 

kitchen 0.931 (0.129) 0.934 (0.130) 0.933 (0.414) 

news 0.989 (0.052) 0.991 (0.054) 0.990 (0.194) 

rcvI 0.991 (0.319) 0.991 (0.360) 0.991 (0.488) 

web search ads 0.832 (0.615) 0.831 (0.632) 0.832 (0.849) 



Kitchen), available from (Dredze 20101, each with 1000 positive examples and 1000 neg- 
ative examples^] as well as the scaled versions of the rcvl. binary (20,242 examples) and 
news20. binary (19,996 examples) data sets from LIBSVM ( Chang and Lin[ 2010). 

All our algorithms use a learning rate scaling parameter 7 (see Section I2j). The optimal 
choice of this parameter can vary somewhat from dataset to dataset, and for different settings 
of the L\ regularization strength A. For these experiments, we first selected the best 7 for 
each (dataset, algorithm, A) combination on a random shuffling of the dataset. We did this 
by training a model using each possible setting of 7 from a reasonable grid (12 points in the 
range [0.3, 1.9]), and choosing the 7 with the highest online AUC. We then fixed this value, 
and report the average AUC over 5 different shufflings of each dataset. We chose the area 
under the ROC curve (AUC) as our accuracy metric as we found it to be more stable and 
have less variance than the mistake fraction. However, results for classification accuracy 
were qualitatively very similar. 



Ranking Search Ads by Click-Through- Rate We collected a dataset of about 1,000,000 
search ad impressions from a large search engine]^] corresponding to ads shown on a small 
set of search queries. We formed examples with a feature vector 6 t for each ad impression, 
using features based on the text of the ad and the query, as well as where on the page the 
ad showed. The target label y t is 1 if the ad was clicked, and -1 otherwise. 

Smaller learning-rates worked better on this dataset; for each (algorithm, A) combination 
we chose the best 7 from 9 points in the range [0.03,0.20]. Rather than shuffling, we report 
results for a single pass over the data using the best 7, processing the events in the order 
the queries actually occurred. We also set a lower bound for the stabilizing terms at of 20.0, 
(corresponding to a maximum learning rate of 0.05), as we found this improved accuracy 
somewhat. Again, qualitative results did not depend on this choice. 

Results Table[2]reports AUC accuracy (larger numbers are better), followed by the density 
of the final predictor xt (number of non-zeros divided by the total number of features present 
in the training data). We measured accuracy online, recording a prediction for each example 
before training on it, and then computing the AUC for this set of predictions. For these 
experiments, we fixed A = 0.05/T (where T is the number of examples in the dataset), 

5 We used the features provided in processed_acl.tar.gz, and scaled each vector of counts to unit length. 
6 While we report results on a single dataset, we repeated the experiments on two others, producing 
qualitatively the same results. No user-specific data was used in these experiments. 
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Figure 1: Sparsity versus accuracy tradeoffs on the 20 newsgroups dataset. Sparsity in- 
creases on the y-axis, and AUC increases on the x-axis, so the top right corner gets the best 
of both worlds. FOBOS is pareto-dominated by FTRL-Proximal and RDA. 

, web search ads 

10" 4 I , 1 1 , 1 1 1 




0.B295 0.8300 0.8305 0.8310 0.B315 0.8320 0.8325 0.B330 
AUC 



Figure 2: The same comparison as the previous figure, but on a large search ads ranking 
dataset. On this dataset, FTRL-Proximal significantly outperforms both other algorithms. 



which was sufficient to introduce non-trivial sparsity. Overall, there is very little difference 
between the algorithms in terms of accuracy, with RDA having a slight edge for these choices 
for A. Our main point concerns the sparsity numbers. It has been shown before that RDA 
outperforms FOBOS in terms of sparsity. The question then is how does FTRL-Proximal 
perform, as it is a hybrid of the two, selecting additional stabilization R t in the manner of 
FOBOS, but handling the L\ regularization in the manner of RDA. These results make it 
very clear: it is the treatment of L\ regularization that makes the key difference for sparsity, 
as FTRL-Proximal behaves very comparably to RDA in this regard. 

Fixing a particular value of A, however, does not tell the whole story. For all these 
algorithms, one can trade off accuracy to get more sparsity by increasing the A parameter. 
The best choice of this parameter depends on the application as well as the dataset. For 
example, if storing the model on an embedded device with expensive memory, sparsity might 
be relatively more important. To show how these algorithms allow different tradeoffs, we 
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plot sparsity versus AUC for the different algorithms over a range of A values. Figure [T] 
shows the tradeoffs for the 20 newsgroups dataset, and Figure [2] shows the tradeoffs for web 
search ads. 

In all cases, FOBOS is pareto-dominated by RDA and FTRL-Proximal. These two algo- 
rithms are almost indistinguishable in the their tradeoff curves on the newsgroups dataset, 
but on the ads dataset FTRL-Proximal significantly outperforms RDA as wellj^] 

6 Conclusions and Open Questions 

The goal of this work has been to extend the theoretical understanding of several families of 
algorithms that have shown significant applied success for large-scale learning problems. We 
have shown that the most commonly used versions of mirror descent, FTRL-Proximal and 
RDA are closely related, and provided evidence that the non-smooth regularization ^ is best 
handled globally, via RDA or FTRL-Proximal. Our analysis also extends these algorithms 
to implicit updates, which can offer significantly improved performance for some problems, 
including applications in active learning and importance-weighted learning. 

Significant open questions remain. The observation that FOBOS is using a subgradient 
approximation for much of the cumulative L\ penalty while RDA and FTRL-Proximal 
handle it exactly provides a compelling explanation for the improved sparsity produced 
by the latter two algorithms. Nevertheless, this is not a proof that these two algorithms 
always produce more sparsity. Quantitative bounds on sparsity have proved theoretically 
very challenging, and any additional results in this direction would be of great interest. 

Similar challenges exist with quantifying the advantage offered by implicit updates. Our 
bounds demonstrate, essentially, a one-step advantage for implicit updates: on any given 
update, the implicit update will increase the regret bound by no more than the explicit 
linearized update, and the inequality will be strict whenever the implicit update is non- 
trivial. However, this is insufficient to say that for any given learning problem implicit 
updates will offer a better bound. After one update, the explicit and implicit algorithms 
will be at different feasible points Xt+i, which means that they will suffer different losses 
under / t+1 and (more importantly) compute and store different gradients for that function. 

This issue is not unique to implicit updates: anytime the real loss functions f t are 
non-linear, but the algorithm approximates them by computing g t — V/t(xt), two different 
first-order algorithms may see a different sequence of g^s; since tight regret bounds depend 
on this sequence, the bounds will not be directly comparable. Generally we assume the 
gradients are bounded, ||pt|| < G, which leads to bounds like 0(GyT), but since a large 
number of algorithms obtain this bound, it cannot be used to discriminate between them. 
Developing finer-grained techniques that can accurately compare the performance of different 
first-order online algorithms on non-linear functions could be of great practical interest to 
the learning community since the loss functions used are almost never linear. 

Acknowledgments 

The author wishes to thank Matt Streeter for numerous helpful discussions and comments, 
and Fernando Pereira for a conversation that helped focus this work on the choice &(x) = 

7 The improvement is more significant than it first appears. A simple model with only features based on 
where the ads were shown achieves an AUC of nearly 0.80, and the inherent uncertainty in the clicks means 
that even predicting perfect probabilities would produce an AUC significantly less than 1.0, perhaps 0.85. 



26 



References 

Jacob Abernethy, Peter L. Bartlett, Alexander Rakhlin, and Ambuj Tewari. Optimal strate- 
gies and minimax lower bounds for online convex games. In COLT, 2008. 

Peter L. Bartlett, Elad Hazan, and Alexander Rakhlin. Adaptive online gradient descent. 
In NIPS, 2007. 

Alina Beygelzimer, Daniel Hsu, John Langford, and Zhang Tong. Agnostic active learning 
without constraints. In NIPS, 2010. 

Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 
2004. 



Chih-Chung Chang and Chih-Jen Lin. LIBSVM data sets, http : / /www . csie.ntu.edu.tw/-cj lin/libsvmtools/ 

2010. 



datasets/ . 



Chuong B. Do, Quoc V. Le, and Chuan-Sheng Foo. Proximal regularization for online and 
batch learning. In ICML, 2009. 



Mark Dredze. Multi-domain sentiment dataset (v2.0). http : //www . cs . jhu.edu/-mdredze/datasets/ sentiment/ , 2010 



John Duchi and Yoram Singer. Efficient learning using forward-backward splitting. In NIPS. 
2009. 

John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online 
learning and stochastic optimization. In COLT, 2010a. 

John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Ambuj Tewari. Composite objective 
mirror descent. In COLT, 2010b. 

Elad Hazan. Extracting certainty from uncertainty: Regret bounded by variation in costs. 
In COLT, 2008. 

Sham M. Kakade, Shai Shalev-shwartz, and Ambuj Tewari. On the duality of strong con- 
vexity and strong smoothness: Learning applications and matrix regularization. 2009. 

Adam Kalai and Santosh Vempala. Efficient algorithms for online decision problems. lournal 
of Computer and Systems Sciences, 71(3), 2005. 



Nikos Karampatziakis and John Langford. Importance weight aware gradient updates, http : 



//arxiv. org/ abs/1011. 1576 2010 



Jyrki Kivinen and Manfred Warmuth. Exponentiated Gradient Versus Gradient Descent for 
Linear Predictors. lournal of Information and Computation, 132, 1997. 

Jyrki Kivinen, Manfred Warmuth, and Babak Hassibi. The p-norm generalization of the 1ms 
algorithm for adaptive filtering. IEEE Transactions on Signal Processing, 54(5), 2006. 

Brian Kulis and Peter Bartlett. Implicit online learning. In ICML, 2010. 



27 



Su-In Lee, Honglak Lee, Pieter Abbeel, and Andrew Y. Ng. Efficient 11 regularized logistic 
regression. In AAA I, 2006. 

H. Brendan McMahan. Follow-the-Regularized-Leader and Mirror Descent: 
Equivalence Theorems and LI Regularization. Submitted, 2010. 

H. Brendan McMahan and Matthew Streeter. Adaptive bound optimization for online 
convex optimization. In COLT, 2010. 

Shai Shalev-Shwartz and Sham M. Kakade. Mind the duality gap: Logarithmic regret 
algorithms for online optimization. In NIPS, pages 1457-1464, 2008. 

Shai Shalev-Shwartz and Yoram Singer. Convex repeated games and fenchel duality. In 
NIPS, 2006. 



Matthew J. Streeter and H. Brendan McMahan. Less regret via online conditioning, http: 
|//arxiv . org/abs/1002 . 4862[ 2010. 

Masashi Sugiyama, Taiji Suzuki, Shinichi Nakajima, Hisashi Kashima, Paul Bunau, and 
Motoaki Kawanabe. Direct importance estimation for covariate shift adaptation. Annals 
of the Institute of Statistical Mathematics, 60(4), 2008. 

Lin Xiao. Dual averaging method for regularized stochastic learning and online optimization. 
In NIPS, 2009. 

Lin Xiao. Dual averaging methods for regularized stochastic learning and online optimiza- 
tion. Journal of Machine Learning Research, 11, 2010. 

Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. 
In ICML, 2003. 

Martin Zinkevich. Theoretical guarantees for algorithms in multi-agent settings. PhD thesis, 
Pittsburgh, PA, USA, 2004. 



28 



